1 Metrics

Labeling factors and exploring scales.
back to Appendix

1.1 Data preliminaries

Initial point of departure - the databox of the selected variables, described in the Methods chapter.
Figure 3.2 This databox corresponds to the dataset dsL produced by Derive_dsL_from_Extract report, given in the Appendix.

dsL<-readRDS("./Data/Derived/dsL.rds")

Figure 3.3

Note that the variable year serves as a natural devided between time invariant (TIvars) and time variant (TVvars) variables. All modeling operations beging with subsetting this dataset. For the grammer rules of operations with relevant data see Data Manipulation Guide.

1.2 Labeling Factor Levels

Review of the item reference cards shows that initially, all items were recorded on some discrete scale, either counting occasions or assigning an intiger to a category of response. However, data were saved as numerical values or intigers

require(dplyr)
ds<- dsL %>%
  dplyr::select(
        sample, id, sex, race, bmonth,byear, attendPR, relprefPR,relraisedPR,
    year,
        agemon, ageyear, famrel, attend,
        values, todo, obeyed, pray, decisions, 
        relpref, bornagain, faith, 
        calm, blue, happy, depressed, nervous,
        tv, computer, internet)               
str(ds)
'data.frame':   134745 obs. of  30 variables:
 $ sample     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ id         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ sex        : int  2 2 2 2 2 2 2 2 2 2 ...
 $ race       : int  4 4 4 4 4 4 4 4 4 4 ...
 $ bmonth     : int  9 9 9 9 9 9 9 9 9 9 ...
 $ byear      : int  1981 1981 1981 1981 1981 1981 1981 1981 1981 1981 ...
 $ attendPR   : int  7 7 7 7 7 7 7 7 7 7 ...
 $ relprefPR  : int  21 21 21 21 21 21 21 21 21 21 ...
 $ relraisedPR: int  21 21 21 21 21 21 21 21 21 21 ...
 $ year       : int  1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 ...
 $ agemon     : num  190 206 219 231 243 256 266 279 290 302 ...
 $ ageyear    : num  15 17 18 19 20 21 22 23 24 25 ...
 $ famrel     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ attend     : num  NA NA NA 1 6 2 1 1 1 1 ...
 $ values     : num  NA NA NA NA NA 1 NA NA 0 NA ...
 $ todo       : num  NA NA NA NA NA 1 NA NA 1 NA ...
 $ obeyed     : num  NA NA NA NA NA 1 NA NA 0 NA ...
 $ pray       : num  NA NA NA NA NA 0 NA NA 0 NA ...
 $ decisions  : num  NA NA NA NA NA 1 NA NA 1 NA ...
 $ relpref    : num  NA NA NA NA NA NA NA NA 21 NA ...
 $ bornagain  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ faith      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ calm       : num  NA NA NA 3 NA 4 NA 4 NA 4 ...
 $ blue       : num  NA NA NA 3 NA 2 NA 1 NA 1 ...
 $ happy      : num  NA NA NA 3 NA 3 NA 4 NA 4 ...
 $ depressed  : num  NA NA NA 3 NA 2 NA 1 NA 1 ...
 $ nervous    : num  NA NA NA 3 NA 1 NA 1 NA 1 ...
 $ tv         : num  NA NA NA NA NA 2 NA NA NA NA ...
 $ computer   : num  NA NA NA NA NA 5 NA NA NA NA ...
 $ internet   : num  NA NA NA NA NA NA 1 0 1 1 ...

LabelingFactorLevels.R sourced at the end of Derive_dsL_from_Extract matches numeric values with response labels from the questionnaire and adds copies of the variables, saved as labeled factors, to dsL. For estimations routines such as lme4 or graphing functions such as ggplot2, the data type (string,numeric, factor) is a meaningful input, so a quick access to both formats frequently proves useful. It is convenient to think that dsL contains only

ncol(dsL)/2
[1] 30

variables, but each of them has a double, a labeled factor.

str(dsL)
'data.frame':   134745 obs. of  60 variables:
 $ sample      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ id          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ sex         : int  2 2 2 2 2 2 2 2 2 2 ...
 $ race        : int  4 4 4 4 4 4 4 4 4 4 ...
 $ bmonth      : int  9 9 9 9 9 9 9 9 9 9 ...
 $ byear       : int  1981 1981 1981 1981 1981 1981 1981 1981 1981 1981 ...
 $ attendPR    : int  7 7 7 7 7 7 7 7 7 7 ...
 $ relprefPR   : int  21 21 21 21 21 21 21 21 21 21 ...
 $ relraisedPR : int  21 21 21 21 21 21 21 21 21 21 ...
 $ year        : int  1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 ...
 $ agemon      : num  190 206 219 231 243 256 266 279 290 302 ...
 $ ageyear     : num  15 17 18 19 20 21 22 23 24 25 ...
 $ famrel      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ attend      : num  NA NA NA 1 6 2 1 1 1 1 ...
 $ values      : num  NA NA NA NA NA 1 NA NA 0 NA ...
 $ todo        : num  NA NA NA NA NA 1 NA NA 1 NA ...
 $ obeyed      : num  NA NA NA NA NA 1 NA NA 0 NA ...
 $ pray        : num  NA NA NA NA NA 0 NA NA 0 NA ...
 $ decisions   : num  NA NA NA NA NA 1 NA NA 1 NA ...
 $ relpref     : num  NA NA NA NA NA NA NA NA 21 NA ...
 $ bornagain   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ faith       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ calm        : num  NA NA NA 3 NA 4 NA 4 NA 4 ...
 $ blue        : num  NA NA NA 3 NA 2 NA 1 NA 1 ...
 $ happy       : num  NA NA NA 3 NA 3 NA 4 NA 4 ...
 $ depressed   : num  NA NA NA 3 NA 2 NA 1 NA 1 ...
 $ nervous     : num  NA NA NA 3 NA 1 NA 1 NA 1 ...
 $ tv          : num  NA NA NA NA NA 2 NA NA NA NA ...
 $ computer    : num  NA NA NA NA NA 5 NA NA NA NA ...
 $ internet    : num  NA NA NA NA NA NA 1 0 1 1 ...
 $ sampleF     : Ord.factor w/ 2 levels "Cross-Sectional"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ idF         : Factor w/ 8983 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sexF        : Ord.factor w/ 3 levels "Male"<"Female"<..: 2 2 2 2 2 2 2 2 2 2 ...
 $ raceF       : Ord.factor w/ 4 levels "Black"<"Hispanic"<..: 4 4 4 4 4 4 4 4 4 4 ...
 $ bmonthF     : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 9 9 9 9 9 9 9 9 9 9 ...
 $ byearF      : Factor w/ 5 levels "1980","1981",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ attendPRF   : Ord.factor w/ 8 levels "Never"<"Once or Twice"<..: 7 7 7 7 7 7 7 7 7 7 ...
 $ relprefPRF  : Ord.factor w/ 33 levels "Catholic"<"Baptist"<..: 21 21 21 21 21 21 21 21 21 21 ...
 $ relraisedPRF: Ord.factor w/ 33 levels "Catholic"<"Baptist"<..: 21 21 21 21 21 21 21 21 21 21 ...
 $ yearF       : Factor w/ 15 levels "1997","1998",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ agemonF     : Factor w/ 244 levels "146","147","148",..: 45 61 74 86 98 111 121 134 145 157 ...
 $ ageyearF    : Factor w/ 21 levels "12","13","14",..: 4 6 7 8 9 10 11 12 13 14 ...
 $ famrelF     : Factor w/ 8 levels "0","1","2","3",..: NA NA NA NA NA NA NA NA NA NA ...
 $ attendF     : Ord.factor w/ 8 levels "Never"<"Once or Twice"<..: NA NA NA 1 6 2 1 1 1 1 ...
 $ valuesF     : Ord.factor w/ 2 levels "FALSE/less Religious"<..: NA NA NA NA NA 2 NA NA 1 NA ...
 $ todoF       : Ord.factor w/ 2 levels "FALSE/less Religious"<..: NA NA NA NA NA 2 NA NA 2 NA ...
 $ obeyedF     : Ord.factor w/ 2 levels "FALSE/less Religious"<..: NA NA NA NA NA 2 NA NA 1 NA ...
 $ prayF       : Ord.factor w/ 2 levels "FALSE/less Religious"<..: NA NA NA NA NA 1 NA NA 1 NA ...
 $ decisionsF  : Ord.factor w/ 2 levels "FALSE/less Religious"<..: NA NA NA NA NA 2 NA NA 2 NA ...
 $ relprefF    : Ord.factor w/ 33 levels "Catholic"<"Baptist"<..: NA NA NA NA NA NA NA NA 21 NA ...
 $ bornagainF  : Ord.factor w/ 2 levels "NO"<"YES": NA NA NA NA NA NA NA NA NA NA ...
 $ faithF      : Ord.factor w/ 5 levels "Exrtemely"<"Very"<..: NA NA NA NA NA NA NA NA NA NA ...
 $ calmF       : Ord.factor w/ 4 levels "All of the time"<..: NA NA NA NA NA NA NA NA NA NA ...
 $ blueF       : Ord.factor w/ 4 levels "All of the time"<..: NA NA NA NA NA NA NA NA NA NA ...
 $ happyF      : Ord.factor w/ 4 levels "All of the time"<..: NA NA NA NA NA NA NA NA NA NA ...
 $ depressedF  : Ord.factor w/ 4 levels "All of the time"<..: NA NA NA NA NA NA NA NA NA NA ...
 $ nervousF    : Ord.factor w/ 4 levels "All of the time"<..: NA NA NA NA NA NA NA NA NA NA ...
 $ tvF         : Ord.factor w/ 6 levels "less than 2"<..: NA NA NA NA NA 2 NA NA NA NA ...
 $ computerF   : Ord.factor w/ 6 levels "None"<"less than 1"<..: NA NA NA NA NA 5 NA NA NA NA ...
 $ internetF   : Ord.factor w/ 2 levels "No"<"Yes": NA NA NA NA NA NA 2 1 2 2 ...

This give a certain flexibity in assembling needed dataset quickly and have access to factor labels. One can alternate between the raw metric and labeled factor by adding “F” suffix to the end of the variable name:

ds<- dsL %>%
  dplyr::filter(id==25) %>%
  dplyr::select(id,byear,year, attend,attendF)
ds
   id byear year attend            attendF
1  25  1983 1997     NA               <NA>
2  25  1983 1998     NA               <NA>
3  25  1983 1999     NA               <NA>
4  25  1983 2000      5  About twice/month
5  25  1983 2001      7 Several times/week
6  25  1983 2002      7 Several times/week
7  25  1983 2003      2      Once or Twice
8  25  1983 2004      7 Several times/week
9  25  1983 2005      5  About twice/month
10 25  1983 2006      7 Several times/week
11 25  1983 2007      5  About twice/month
12 25  1983 2008      7 Several times/week
13 25  1983 2009      7 Several times/week
14 25  1983 2010      7 Several times/week
15 25  1983 2011      7 Several times/week

Having quick access to factor labels will be especially useful during graph production.

1.3 Time metrics : Age, Period, Cohort

NLSY97 sample includes individuals from five cohorts, born between 1980 and 1984.The following graphics shows how birth cohort, age of respondents, and round of observation are related in NSLY97.
Figure 3.1

1.3.1 Counts of respondents across age, period, and cohort

ds<- dsL %>%  # chose conditions to apply in creating the dataset 
  dplyr::filter(id %in% c(1:9022)) %.% # 1:9022
  dplyr::filter(year %in% c(2000:2011)) %.% # 1997:2011
  dplyr::filter(sample %in% c(0,1)) %.% # 0-Oversample; 1-Cross-Sectional
  dplyr::filter(race %in% c(1:4)) %.% # 1-Black; 2-Hispanis; 3-Mixed; 4-White
  dplyr::filter(byear %in% c(1980:1984)) %.%
  dplyr::mutate(age=year-byear) # define bin for age

1.3.1.1 Wide age

table(ds$byear, ds$age)
      
         16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31
  1980    0    0    0    0 1691 1691 1691 1691 1691 1691 1691 1691 1691 1691 1691 1691
  1981    0    0    0 1874 1874 1874 1874 1874 1874 1874 1874 1874 1874 1874 1874    0
  1982    0    0 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841    0    0
  1983    0 1807 1807 1807 1807 1807 1807 1807 1807 1807 1807 1807 1807    0    0    0
  1984 1770 1770 1770 1770 1770 1770 1770 1770 1770 1770 1770 1770    0    0    0    0

1.3.1.2 Wide wave

table(ds$byear,ds$year)
      
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  1980 1691 1691 1691 1691 1691 1691 1691 1691 1691 1691 1691 1691
  1981 1874 1874 1874 1874 1874 1874 1874 1874 1874 1874 1874 1874
  1982 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841 1841
  1983 1807 1807 1807 1807 1807 1807 1807 1807 1807 1807 1807 1807
  1984 1770 1770 1770 1770 1770 1770 1770 1770 1770 1770 1770 1770

1.3.1.3 Long wave

table(ds$year,ds$byear)
      
       1980 1981 1982 1983 1984
  2000 1691 1874 1841 1807 1770
  2001 1691 1874 1841 1807 1770
  2002 1691 1874 1841 1807 1770
  2003 1691 1874 1841 1807 1770
  2004 1691 1874 1841 1807 1770
  2005 1691 1874 1841 1807 1770
  2006 1691 1874 1841 1807 1770
  2007 1691 1874 1841 1807 1770
  2008 1691 1874 1841 1807 1770
  2009 1691 1874 1841 1807 1770
  2010 1691 1874 1841 1807 1770
  2011 1691 1874 1841 1807 1770

1.3.1.4 Long age

table(ds$age,ds$byear)
    
     1980 1981 1982 1983 1984
  16    0    0    0    0 1770
  17    0    0    0 1807 1770
  18    0    0 1841 1807 1770
  19    0 1874 1841 1807 1770
  20 1691 1874 1841 1807 1770
  21 1691 1874 1841 1807 1770
  22 1691 1874 1841 1807 1770
  23 1691 1874 1841 1807 1770
  24 1691 1874 1841 1807 1770
  25 1691 1874 1841 1807 1770
  26 1691 1874 1841 1807 1770
  27 1691 1874 1841 1807 1770
  28 1691 1874 1841 1807    0
  29 1691 1874 1841    0    0
  30 1691 1874    0    0    0
  31 1691    0    0    0    0

1.3.1.5 Two time metrics: rounds across ages

table(ds$age,ds$year)
    
     2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  16 1770    0    0    0    0    0    0    0    0    0    0    0
  17 1807 1770    0    0    0    0    0    0    0    0    0    0
  18 1841 1807 1770    0    0    0    0    0    0    0    0    0
  19 1874 1841 1807 1770    0    0    0    0    0    0    0    0
  20 1691 1874 1841 1807 1770    0    0    0    0    0    0    0
  21    0 1691 1874 1841 1807 1770    0    0    0    0    0    0
  22    0    0 1691 1874 1841 1807 1770    0    0    0    0    0
  23    0    0    0 1691 1874 1841 1807 1770    0    0    0    0
  24    0    0    0    0 1691 1874 1841 1807 1770    0    0    0
  25    0    0    0    0    0 1691 1874 1841 1807 1770    0    0
  26    0    0    0    0    0    0 1691 1874 1841 1807 1770    0
  27    0    0    0    0    0    0    0 1691 1874 1841 1807 1770
  28    0    0    0    0    0    0    0    0 1691 1874 1841 1807
  29    0    0    0    0    0    0    0    0    0 1691 1874 1841
  30    0    0    0    0    0    0    0    0    0    0 1691 1874
  31    0    0    0    0    0    0    0    0    0    0    0 1691

NSLY97 contains static (bmonth, byear) and dynamic (agemon, ageyear) indicators of age :

ds<- dsL %>% 
  dplyr::filter(id==25, year %in% c(1997:2011)) %>% 
  dplyr::select(id,byear,bmonthF,year,agemon,ageyear)
print(ds)
   id byear bmonthF year agemon ageyear
1  25  1983     Mar 1997    167      13
2  25  1983     Mar 1998    188      15
3  25  1983     Mar 1999    201      16
4  25  1983     Mar 2000    214      17
5  25  1983     Mar 2001    226      18
6  25  1983     Mar 2002    236      19
7  25  1983     Mar 2003    254      21
8  25  1983     Mar 2004    261      21
9  25  1983     Mar 2005    272      22
10 25  1983     Mar 2006    284      23
11 25  1983     Mar 2007    295      24
12 25  1983     Mar 2008    307      25
13 25  1983     Mar 2009    319      26
14 25  1983     Mar 2010    332      27
15 25  1983     Mar 2011    342      28

Variable year is used as cohort indicator. Variable year enumerates NLSY97 rounds, recording the calendaric year during which it took place. When transforming the metric of time, and using biological age instead of year as the temporal dimension, the value of age at the time of the interview will be computed as age = agemon/12

ds<- dsL %>% 
  dplyr::filter(id==25, year %in% c(1997:2011)) %>% 
  dplyr::select(id,bmonthF,byear,year, agemon,ageyear) %>%
  dplyr::mutate (age =  agemon/12)
print(ds)
   id bmonthF byear year agemon ageyear   age
1  25     Mar  1983 1997    167      13 13.92
2  25     Mar  1983 1998    188      15 15.67
3  25     Mar  1983 1999    201      16 16.75
4  25     Mar  1983 2000    214      17 17.83
5  25     Mar  1983 2001    226      18 18.83
6  25     Mar  1983 2002    236      19 19.67
7  25     Mar  1983 2003    254      21 21.17
8  25     Mar  1983 2004    261      21 21.75
9  25     Mar  1983 2005    272      22 22.67
10 25     Mar  1983 2006    284      23 23.67
11 25     Mar  1983 2007    295      24 24.58
12 25     Mar  1983 2008    307      25 25.58
13 25     Mar  1983 2009    319      26 26.58
14 25     Mar  1983 2010    332      27 27.67
15 25     Mar  1983 2011    342      28 28.50

1.4 Attendance

NLSY97 asked to report church attendance (attend) for the past 12 months preceding the interview date. The response card offered a choice of 7 categories ordered by magnitude.

Figure caption test

2 Descriptives

Basic descriptives reports on the selected NLSY97 items
back to Appendix

2.1 Basic demographics

A clean dataset dsL contains data on

dplyr::summarize(dsL,count=n_distinct(id))
  count
1  8983

respondents. Of them one (id = 467) was removed from the dataset due to abberant age score that seemed as a coding mistake. NLSY97 contains representative household sample and the oversample of racial minorities.

ds<- dsL %>% 
  dplyr::group_by(sampleF) %>% 
  dplyr::summarize (count=n_distinct(id))
ds
Source: local data frame [2 x 2]

          sampleF count
1 Cross-Sectional  6747
2      Oversample  2236

plot of chunk basic_demo

2.2 Distribution of age variables

The age of respondents was of particular interest and was entered as a predictor of the model outcome. NSLY97 contains static and dynamic indicators of age age. Variables byear and bmonth were recorded once in 1997 (static) and contain respondents’ birth year and birth month respectively. Two age variables were recorded continuously at each interview (dynamic): age at the time of the interview in months (agemon) and in years (ageyear). Next graph shows how births in the NLSY97 sample was distributed over calendric months from 1980 to 1984:

2.2.1 Months of births

Source: local data frame [10 x 5]
Groups: bmonth, bmonthF

   bmonth bmonthF byear count born
1       1     Jan  1980   159 1980
2       2     Feb  1980   136 1980
3       3     Mar  1980   139 1980
4       4     Apr  1980   125 1980
5       5     May  1980   128 1980
6       6     Jun  1980   137 1980
7       7     Jul  1980   136 1981
8       8     Aug  1980   141 1981
9       9     Sep  1980   144 1981
10     10     Oct  1980   146 1981

plot of chunk bmonth_dist

2.2.2 Age and cohort structure

plot of chunk agemon_dist

2.2.3 Count of respondents in age bins

The amounts of respondents in in the age bin =16 is not big, so results should be interpreted with caution. plot of chunk attend_race_age_binsize

ds<- dsL %.%
  dplyr::mutate(ageF = as.factor(round(agemon/12 ))) %>%
  dplyr::filter(year %in% c(2000:2011), !is.na(attend),ageF %in% c(16,32),race !=3) %.%   
  dplyr::group_by(sampleF,raceF) %.%
  dplyr::summarize(count = n()) 
head(ds, 10)
Source: local data frame [5 x 3]
Groups: sampleF

          sampleF       raceF count
1 Cross-Sectional       Black   187
2 Cross-Sectional    Hispanic   153
3 Cross-Sectional Non-B/Non-H   687
4      Oversample       Black   174
5      Oversample    Hispanic   140

2.3 Counts for data used in models

The following tables inspect the counts of valid responses (church attendance) across two time metircs (age and NLSY97 round) and cohort membership.

ds<- dsL %>%  # chose conditions to apply in creating dataset for modeling
  dplyr::filter(id %in% c(1:9022)) %.% # 1:9022
  dplyr::filter(year %in% c(2000:2011)) %.% # 1997:2011
  dplyr::filter(sample %in% c(0,1)) %.% # 0-Oversample; 1-Cross-Sectional
  dplyr::filter(race %in% c(1:4)) %.% # 1-Black; 2-Hispanis; 3-Mixed; 4-White
  dplyr::filter(byear %in% c(1980:1984)) %>% # birth year 1980:1984
  dplyr::mutate(age=round(agemon/12,0)) # define bin for age
length(unique(ds$id)) # total No. of respondents in dataset
[1] 8983
d<- ds %>% 
  filter(!is.na(attend)) # only nonmissing datapoints
table(d$byear,d$year) # Number of respondent in each NLSY97 round 
      
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  1980 1423 1404 1409 1383 1319 1307 1334 1327 1327 1366 1317 1297
  1981 1617 1578 1582 1562 1477 1416 1502 1462 1472 1502 1501 1497
  1982 1656 1601 1612 1588 1529 1506 1521 1502 1507 1529 1502 1477
  1983 1672 1636 1629 1603 1506 1491 1525 1499 1512 1502 1495 1480
  1984 1650 1630 1632 1578 1484 1463 1488 1473 1459 1460 1455 1440
table(d$byear, d$age) # Number of respondent in each age bin 
      
         16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32
  1980    0    0    0    0  702 1315 1549 1350 1414 1245 1456 1290 1406 1343 1342 1327  474
  1981    0    0    0  690 1743 1466 1780 1328 1614 1386 1687 1306 1737 1247 1753  431    0
  1982    0    0  848 1528 1766 1510 1629 1448 1699 1444 1594 1517 1491 1503  553    0    0
  1983    0  724 1815 1519 1767 1376 1681 1449 1713 1329 1765 1250 1752  410    0    0    0
  1984  879 1535 1758 1550 1580 1416 1635 1425 1530 1479 1447 1464  514    0    0    0    0
table(d$age,d$year) # cross-section of age and round of NLSY97
    
     2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  16  879    0    0    0    0    0    0    0    0    0    0    0
  17 1495  764    0    0    0    0    0    0    0    0    0    0
  18 1796 1733  892    0    0    0    0    0    0    0    0    0
  19 1498 1489 1490  810    0    0    0    0    0    0    0    0
  20 1629 1697 1764 1656  812    0    0    0    0    0    0    0
  21  721 1356 1430 1499 1333  744    0    0    0    0    0    0
  22    0  810 1618 1706 1669 1555  916    0    0    0    0    0
  23    0    0  669 1341 1373 1398 1366  853    0    0    0    0
  24    0    0    1  702 1520 1568 1667 1602  910    0    0    0
  25    0    0    0    0  608 1248 1360 1376 1361  930    0    0
  26    0    0    0    0    0  670 1513 1603 1651 1595  917    0
  27    0    0    0    0    0    0  548 1244 1360 1398 1351  926
  28    0    0    0    0    0    0    0  585 1489 1637 1605 1584
  29    0    0    0    0    0    0    0    0  506 1270 1393 1334
  30    0    0    0    0    0    0    0    0    0  529 1500 1619
  31    0    0    0    0    0    0    0    0    0    0  504 1254
  32    0    0    0    0    0    0    0    0    0    0    0  474

When subsetting data to run throught the sequence of models, it would of use to see such counts. Insufficient number of observation per cell might be behind poor model estimation. For example, male blacks from the oversample may not be a good subset to fit a complex model.

ds<- dsL %>%  # chose conditions to apply in creating dataset for modeling
  dplyr::filter(id %in% c(1:9022)) %.% # 1:9022
  dplyr::filter(year %in% c(2000:2011)) %.% # 1997:2011
  dplyr::filter(sample %in% c(0)) %.% # 0-Oversample; 1-Cross-Sectional
  dplyr::filter(race %in% c(1)) %.% # 1-Black; 2-Hispanis; 3-Mixed; 4-White
  dplyr::filter(byear %in% c(1980:1984)) %>% # birth year 1980:1984
  dplyr::filter(sex==1) %>% # 1-Male; 2-Female
  dplyr::mutate(age=round(agemon/12,0)) # define bin for age

length(unique(ds$id)) # total No. of respondents in dataset
[1] 632
d<- ds %>% 
  filter(!is.na(attend)) # only nonmissing datapoints
table(d$byear,d$year) # Number of respondent in each NLSY97 round 
      
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  1980   89   89   92   90   80   80   80   82   82   79   79   82
  1981  139  127  137  140  113  112  127  121  121  116  123  122
  1982  127  116  125  121  107  114  115  109  112  111  110  111
  1983  115  112  112  108   96   99  102  100  100   96   93   92
  1984  104   96   99   94   84   92   87   81   88   80   93   91
table(d$byear, d$age) # Number of respondent in each age bin 
      
        16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32
  1980   0   0   0   0  51  79 102  87  89  79  79  88  77  87  76  82  28
  1981   0   0   0  62 151 120 159 106 128 114 138 111 138  96 143  32   0
  1982   0   0  57 119 136 116 120 103 128 109 112 116 104 113  45   0   0
  1983   0  47 130 102 118  86 118  92 112  89 116  82 108  25   0   0   0
  1984  48 102  98  99  89  81 104  81  88  84  83  96  36   0   0   0   0
table(d$age,d$year) # cross-section of age and round of NLSY97
    
     2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  16   48    0    0    0    0    0    0    0    0    0    0    0
  17  103   46    0    0    0    0    0    0    0    0    0    0
  18  125  112   48    0    0    0    0    0    0    0    0    0
  19  132   99  103   48    0    0    0    0    0    0    0    0
  20  128  141  129  104   43    0    0    0    0    0    0    0
  21   38   94  123  110   77   40    0    0    0    0    0    0
  22    0   48  124  150  119  110   52    0    0    0    0    0
  23    0    0   38  100  103   96   86   46    0    0    0    0
  24    0    0    0   41  106  129  120   96   53    0    0    0
  25    0    0    0    0   32   89  118  102   85   49    0    0
  26    0    0    0    0    0   33  101  129  116   97   52    0
  27    0    0    0    0    0    0   34   92  119  100   93   55
  28    0    0    0    0    0    0    0   28   97  131  104  103
  29    0    0    0    0    0    0    0    0   33   80  117   91
  30    0    0    0    0    0    0    0    0    0   25  104  135
  31    0    0    0    0    0    0    0    0    0    0   28   86
  32    0    0    0    0    0    0    0    0    0    0    0   28

3 Attendance

Mapping church attendance in time
back to Appendix

4 Cross-Sectional View

The focal variable of interest is attend, the item measuring church attendance for the year that preceded the interview date. The questionnaire recorded the responses on the ordinal scale.

Figure caption test

Creating frequency distributions for each of the measurement wave we have:

plot of chunk attend_2000_2011

Here, missing values are used in the calculation of total responses to show the natural attrition in the study. Assuming that attrition is not significantly associated with the outcome measure, we can remove missing values from the calculation of the total and look at prevalence of response endorsements over time.

plot of chunk attend_2000_2011_na

4.1 Change in prevalences

Tracing the rate of change of prevalence in a line graph, we see more clearly which categores increase over time (e.g. “Never”), which decline (e.g. “”About once/week), and which stay relatively stable (e.g. “About twice/month”).

plot of chunk attend_freq_lines

4.2 Prevalence change and race

Inspecting the prevalence trajectories across races.

4.2.1 Attend over race and years

plot of chunk attend_freq_lines_race

4.2.2 Attend over race and ages

Bin includes all respondents who were +/- 6 months from a given age expressed as an intiger.s For example, a 16 year-old is defined as an individual between 15.5 and 16.5 years of age at the time of the interview.

plot of chunk attend_freq_lines_race_ages

4.2.3 Race and attendance for each year

plot of chunk attend_race_2000

4.2.4 Race and attendance for each age bin

plot of chunk attend_race_16

4.3 Counts for data used in models

The following tables inspect the counts of valid responses (church attendance) across two time metircs (age and NLSY97 round) and cohort membership.

dsL<-readRDS("./Data/Derived/dsL.rds")
ds<- dsL %>%  # chose conditions to apply in creating dataset for modeling
  dplyr::filter(id %in% c(1:9022)) %.% # 1:9022
  dplyr::filter(year %in% c(2000:2011)) %.% # 1997:2011
  dplyr::filter(sample %in% c(0,1)) %.% # 0-Oversample; 1-Cross-Sectional
  dplyr::filter(race %in% c(1:4)) %.% # 1-Black; 2-Hispanis; 3-Mixed; 4-White
  dplyr::filter(byear %in% c(1980:1984)) %>% # birth year 1980:1984
  dplyr::mutate(age=year-byear) # define bin for age
length(unique(ds$id)) # total No. of respondents in dataset
[1] 8983
d<- ds %>% 
  filter(is.na(attend)) # only nonmissing datapoints
table(d$byear,d$year) # Number of respondent in each NLSY97 round 
      
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  1980  268  287  282  308  372  384  357  364  364  325  374  394
  1981  257  296  292  312  397  458  372  412  402  372  373  377
  1982  185  240  229  253  312  335  320  339  334  312  339  364
  1983  135  171  178  204  301  316  282  308  295  305  312  327
  1984  120  140  138  192  286  307  282  297  311  310  315  330
table(d$byear, d$age) # Number of respondent in each age bin 
      
        16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31
  1980   0   0   0   0 268 287 282 308 372 384 357 364 364 325 374 394
  1981   0   0   0 257 296 292 312 397 458 372 412 402 372 373 377   0
  1982   0   0 185 240 229 253 312 335 320 339 334 312 339 364   0   0
  1983   0 135 171 178 204 301 316 282 308 295 305 312 327   0   0   0
  1984 120 140 138 192 286 307 282 297 311 310 315 330   0   0   0   0
table(d$age,d$year) # cross-section of age and round of NLSY97
    
     2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  16  120    0    0    0    0    0    0    0    0    0    0    0
  17  135  140    0    0    0    0    0    0    0    0    0    0
  18  185  171  138    0    0    0    0    0    0    0    0    0
  19  257  240  178  192    0    0    0    0    0    0    0    0
  20  268  296  229  204  286    0    0    0    0    0    0    0
  21    0  287  292  253  301  307    0    0    0    0    0    0
  22    0    0  282  312  312  316  282    0    0    0    0    0
  23    0    0    0  308  397  335  282  297    0    0    0    0
  24    0    0    0    0  372  458  320  308  311    0    0    0
  25    0    0    0    0    0  384  372  339  295  310    0    0
  26    0    0    0    0    0    0  357  412  334  305  315    0
  27    0    0    0    0    0    0    0  364  402  312  312  330
  28    0    0    0    0    0    0    0    0  364  372  339  327
  29    0    0    0    0    0    0    0    0    0  325  373  364
  30    0    0    0    0    0    0    0    0    0    0  374  377
  31    0    0    0    0    0    0    0    0    0    0    0  394
table(d$attend,useNA="always")

 <NA> 
18123 

Such view allows to see quickly whether each cell contains enough observations to offer stability for estimated solutions.

ds<- dsL %>%  # chose conditions to apply in creating dataset for modeling
  dplyr::filter(id %in% c(1:9022)) %.% # 1:9022
  dplyr::filter(year %in% c(2000:2011)) %.% # 1997:2011
  dplyr::filter(sample %in% c(0)) %.% # 0-Oversample; 1-Cross-Sectional
  dplyr::filter(race %in% c(1)) %.% # 1-Black; 2-Hispanis; 3-Mixed; 4-White
  dplyr::filter(byear %in% c(1980:1984)) %>% # birth year 1980:1984
  dplyr::filter(sex==1) %>% # 1-Male; 2-Female
  dplyr::mutate(age=round(agemon/12,0)) # define bin for age
length(unique(ds$id)) # total No. of respondents in dataset
[1] 632
d<- ds %>% 
  filter(!is.na(attend)) # only nonmissing datapoints
table(d$byear,d$year) # Number of respondent in each NLSY97 round 
      
       2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  1980   89   89   92   90   80   80   80   82   82   79   79   82
  1981  139  127  137  140  113  112  127  121  121  116  123  122
  1982  127  116  125  121  107  114  115  109  112  111  110  111
  1983  115  112  112  108   96   99  102  100  100   96   93   92
  1984  104   96   99   94   84   92   87   81   88   80   93   91
table(d$byear, d$age) # Number of respondent in each age bin 
      
        16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32
  1980   0   0   0   0  51  79 102  87  89  79  79  88  77  87  76  82  28
  1981   0   0   0  62 151 120 159 106 128 114 138 111 138  96 143  32   0
  1982   0   0  57 119 136 116 120 103 128 109 112 116 104 113  45   0   0
  1983   0  47 130 102 118  86 118  92 112  89 116  82 108  25   0   0   0
  1984  48 102  98  99  89  81 104  81  88  84  83  96  36   0   0   0   0
table(d$age,d$year) # cross-section of age and round of NLSY97
    
     2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  16   48    0    0    0    0    0    0    0    0    0    0    0
  17  103   46    0    0    0    0    0    0    0    0    0    0
  18  125  112   48    0    0    0    0    0    0    0    0    0
  19  132   99  103   48    0    0    0    0    0    0    0    0
  20  128  141  129  104   43    0    0    0    0    0    0    0
  21   38   94  123  110   77   40    0    0    0    0    0    0
  22    0   48  124  150  119  110   52    0    0    0    0    0
  23    0    0   38  100  103   96   86   46    0    0    0    0
  24    0    0    0   41  106  129  120   96   53    0    0    0
  25    0    0    0    0   32   89  118  102   85   49    0    0
  26    0    0    0    0    0   33  101  129  116   97   52    0
  27    0    0    0    0    0    0   34   92  119  100   93   55
  28    0    0    0    0    0    0    0   28   97  131  104  103
  29    0    0    0    0    0    0    0    0   33   80  117   91
  30    0    0    0    0    0    0    0    0    0   25  104  135
  31    0    0    0    0    0    0    0    0    0    0   28   86
  32    0    0    0    0    0    0    0    0    0    0    0   28

5 Longitudinal View

Graphs above shows change in the cross-sectional distribution of responses over the years. Modeling the change in these response frequencies is handled well by Markov models. LCM, however, works with longitudinal data, modeling the trajectory of each individual and treating attendance as a continuous outcome.

To demonstrate mapping of individual trajectories to time, let’s select a dataset that would include personal identifyer (id), cohort indicator (byear), wave of measurement (year) and the focal variable of interest - worship attendance (attend).

ds<- dsL %>%  dplyr::filter(year %in% c(2000:2011), id==47) %>%
              dplyr:: select(id, byear, year, attend, attendF)
print(ds)
   id byear year attend              attendF
1  47  1982 2000      5    About twice/month
2  47  1982 2001      2        Once or Twice
3  47  1982 2002      4     About once/month
4  47  1982 2003      2        Once or Twice
5  47  1982 2004      3 Less than once/month
6  47  1982 2005      2        Once or Twice
7  47  1982 2006      2        Once or Twice
8  47  1982 2007      3 Less than once/month
9  47  1982 2008      2        Once or Twice
10 47  1982 2009      1                Never
11 47  1982 2010      1                Never
12 47  1982 2011      1                Never

The view above lists attendance data for subjust with id = 47. Mapping his attendance to time we have .

plot of chunk attend_line_1id

where vertical dimension maps the outcome value and the horizontal maps the time. There will be a trajecory for each of the respondents. Each of such trajectories imply a story, a life scenario. Why one person grows in his religious involvement, while other declines, or never develops an interest in the first place? To demostrate how interpretations of trajectories can vary among individuals consider the following example.

5.1 Attendance over waves

Attendance trajectories of subjects with ids 4, 25, 35, and 47 are plotted in the next graph

plot of chunk attend_line_4id_years

The respondent id = 35 reported attending no worship services in any of the years, while respodent id = 25 seemed to frequent it, indicating weekly attendance in 8 out of the 12 years. Individual id = 47 started as a fairly regular attendee of religious services in 2000 (5 = “about twice a month”), then gradually declined his involvement to nill in 2009 and on. Respondent id = 4, on the other hand started off with a rather passive involvement, reporting attended church only “Once or twice” in 2000, maintained a low level of participation throughout the years, only to surge his attendance in 2011. Latent curve models will describe intraindividual trajectories of change, while summarizinig the interindividual similarities and trends.

5.1.1 Changing the metric of time

Previous research in religiousity indicated that age might be one of the primary factors explaining interindividual differences in church attendance. To examine the role of age, we change the metric of time from waves of measurement, as in the previous graph, to biological age. Consult Metrics report for details on measurement of age.

ds<- dsL %>% dplyr::filter(id %in% c(4,25,35,47),year %in% c(2000:2011)) %>% 
  dplyr::select(idF,byear,bmonth,year,ageyear,agemon) %.%
  dplyr::mutate(time=year-2000, age=agemon/12)
print(ds[ds$idF==25,])
   idF byear bmonth year ageyear agemon time   age
13  25  1983      3 2000      17    214    0 17.83
14  25  1983      3 2001      18    226    1 18.83
15  25  1983      3 2002      19    236    2 19.67
16  25  1983      3 2003      21    254    3 21.17
17  25  1983      3 2004      21    261    4 21.75
18  25  1983      3 2005      22    272    5 22.67
19  25  1983      3 2006      23    284    6 23.67
20  25  1983      3 2007      24    295    7 24.58
21  25  1983      3 2008      25    307    8 25.58
22  25  1983      3 2009      26    319    9 26.58
23  25  1983      3 2010      27    332   10 27.67
24  25  1983      3 2011      28    342   11 28.50

5.2 Attendance over ages

Plotting individual trajectories, with age as the metric of time.

plot of chunk attend_line_4id_age3

6 Data Manipulation Guide

Demonstrating the language of data manipulation in dplyr packages using dsL as an example
back to Appendix

6.1 Five basic functions in data handling

For a more detailed discussion of basic verbs and operations consult the [R-Studio guide][1] or internal [vignette][2]

vignette("introduction",package="dplyr")

The following is a brief demonstration of dplyr syntax using dsL dataset as an example. I attach prefix dplyr:: to avoid possible conflicts with plyr package on which ggplot2 package relies. I recommend such practice in all dplyr expressions in sharable publications.

One of the innovations in dplyr is the ability to chain phrases in the data manipulationsentence. The operator %>% (or %.%), accomplishes this, turning x %>% f(y) into f(x, y) .

6.1.1 select()

selects variables into a smaller data set

ds<-dsL
dim(ds)
[1] 134745     60
ds<- dsL %>%
  dplyr::select(id,year, byear, attend, attendF)
head(ds,13)
   id year byear attend         attendF
1   1 1997  1981     NA            <NA>
2   1 1998  1981     NA            <NA>
3   1 1999  1981     NA            <NA>
4   1 2000  1981      1           Never
5   1 2001  1981      6 About once/week
6   1 2002  1981      2   Once or Twice
7   1 2003  1981      1           Never
8   1 2004  1981      1           Never
9   1 2005  1981      1           Never
10  1 2006  1981      1           Never
11  1 2007  1981      1           Never
12  1 2008  1981      1           Never
13  1 2009  1981      1           Never
dim(ds)
[1] 134745      5

6.1.2 filter()

Removes observations that do not meet criteria. The following code selects observation based on the type of sample

  sample         sampleF
1      1 Cross-Sectional
2      0      Oversample

and only between years 2000 and 2011, as only during those years the outcome of interest attend was recorded.

ds<- dsL %>%
  dplyr::filter(sample==1, year %in% c(2000:2011))%>%
  dplyr::select(id, year, attend, attendF)
head(ds,13)
   id year attend         attendF
1   1 2000      1           Never
2   1 2001      6 About once/week
3   1 2002      2   Once or Twice
4   1 2003      1           Never
5   1 2004      1           Never
6   1 2005      1           Never
7   1 2006      1           Never
8   1 2007      1           Never
9   1 2008      1           Never
10  1 2009      1           Never
11  1 2010      1           Never
12  1 2011      1           Never
13  2 2000      2   Once or Twice

6.1.3 arrange()

Sorts observations

ds<- dsL %>%
  dplyr::filter(sample==1, year %in% c(2000:2011)) %>%
  dplyr::select(id, year, attend) %>%
  dplyr::arrange(year, desc(id))
head(ds,13)
     id year attend
1  9022 2000      1
2  9021 2000      2
3  9020 2000      2
4  9018 2000      4
5  9017 2000      6
6  9012 2000      5
7  9011 2000      6
8  9010 2000      1
9  9009 2000      2
10 9008 2000      6
11 8992 2000     NA
12 8991 2000      3
13 8987 2000      6
ds<- dplyr::arrange(ds, id, year)
head(ds, 13)
   id year attend
1   1 2000      1
2   1 2001      6
3   1 2002      2
4   1 2003      1
5   1 2004      1
6   1 2005      1
7   1 2006      1
8   1 2007      1
9   1 2008      1
10  1 2009      1
11  1 2010      1
12  1 2011      1
13  2 2000      2

6.1.4 mutate()

Creates additional variables from the values of existing.

ds<- dsL %>%
  dplyr::filter(sample==1, year %in% c(2000:2011)) %>%
  dplyr::select(id, byear, year, attend) %>%
  dplyr::mutate(age = year-byear, 
                timec = year-2000,
                linear= timec,
                quadratic= linear^2,
                cubic= linear^3)
head(ds,13)
   id byear year attend age timec linear quadratic cubic
1   1  1981 2000      1  19     0      0         0     0
2   1  1981 2001      6  20     1      1         1     1
3   1  1981 2002      2  21     2      2         4     8
4   1  1981 2003      1  22     3      3         9    27
5   1  1981 2004      1  23     4      4        16    64
6   1  1981 2005      1  24     5      5        25   125
7   1  1981 2006      1  25     6      6        36   216
8   1  1981 2007      1  26     7      7        49   343
9   1  1981 2008      1  27     8      8        64   512
10  1  1981 2009      1  28     9      9        81   729
11  1  1981 2010      1  29    10     10       100  1000
12  1  1981 2011      1  30    11     11       121  1331
13  2  1982 2000      2  18     0      0         0     0

6.1.5 summarize()

collapses data into a single value computed according to the aggregate functions.

require(dplyr)
ds<- dsL %>%
  dplyr::filter(sample==1) %>%
  dplyr::summarize(N= n_distinct(id))
ds
     N
1 6747

Other functions one could use with summarize() include:

From base

  • min()
  • max()
  • mean()
  • sum()
  • sd()
  • median()
  • IQR()

Native to dplyr

  • n() - number of observations in the current group
  • n_distinct(x) - count the number of unique values in x.
  • first(x) - similar to x[ 1 ] + control over NA
  • last(x) - similar to x[length(x)] + control over NA
  • nth(x, n) - similar to x[n] + control over NA

6.2 Grouping and Combining

The function group_by() is used to identify groups in split-apply-combine (SAC) procedure: it splits the initial data into smaller datasets (according to all possible interactions between the levels of supplied variables). It is these smaller datasets that summarize() will individually collapse into a single computed value according to its formula.

ds<- dsL %>%
  dplyr::filter(sample==1, year %in% c(2000:2011)) %>%
  dplyr::select(id, year, attendF) %>%
  dplyr::group_by(year,attendF) %>%
  dplyr::summarise(count = n()) %>%
  dplyr::mutate(total = sum(count),
              percent= count/total)
head(ds,10)
Source: local data frame [10 x 5]
Groups: year

   year              attendF count total  percent
1  2000                Never  1580  6747 0.234178
2  2000        Once or Twice  1304  6747 0.193271
3  2000 Less than once/month   775  6747 0.114866
4  2000     About once/month   362  6747 0.053653
5  2000    About twice/month   393  6747 0.058248
6  2000      About once/week  1101  6747 0.163184
7  2000   Several times/week   463  6747 0.068623
8  2000             Everyday    36  6747 0.005336
9  2000                   NA   733  6747 0.108641
10 2001                Never  1626  6747 0.240996

To verify :

dplyr::summarize( filter(ds, year==2000), should.be.one=sum(percent))
Source: local data frame [1 x 2]

  year should.be.one
1 2000             1

6.3 Base subsetting

Generally, we can compose any desired dataset by using matrix calls. The general formula is of the form: ds[ rowCond , colCond ], where ds is a dataframe, and rowCond and colCond are conditions for including rows and columns of the new dataset, respectively. One can also call a variable by attaching $ followed variable name to the name of the dataset: ds$variableName.

ds<-dsL[dsL$year %in% c(2000:2011),c('id',"byear","year","agemon","attendF","ageyearF")]
print(ds[ds$id==1,]) 
   id byear year agemon         attendF ageyearF
4   1  1981 2000    231           Never       19
5   1  1981 2001    243 About once/week       20
6   1  1981 2002    256   Once or Twice       21
7   1  1981 2003    266           Never       22
8   1  1981 2004    279           Never       23
9   1  1981 2005    290           Never       24
10  1  1981 2006    302           Never       25
11  1  1981 2007    313           Never       26
12  1  1981 2008    325           Never       27
13  1  1981 2009    337           Never       28
14  1  1981 2010    350           Never       29
15  1  1981 2011    360           Never       29
The following is a list of operatiors that can be used in these calls.
  • basic math operators: +, -, *, /, %%, ^
  • math functions: abs, acos, acosh, asin, asinh, atan, atan2, atanh, ceiling, cos, cosh, cot, coth, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh
  • logical comparisons: <, <=, !=, >=, >, ==, %in%
  • boolean operations: &, &&, |, ||, !, xor
  • basic aggregations: mean, sum, min, max, sd, var

dplyr can translate all of these into SQL. For more of on dplyr and SQL compatibility consult another built-in [vignette][3]

vignette("database",package="dplyr")

6.4 Base Reference

The following unary and binary operators are defined for base. They are listed in precedence groups, from highest to lowest.

  • :: ::: - access variables in a namespace
  • $ @ - component / slot extraction
  • [ [[ - indexing
  • ^ - exponentiation (right to left)
  • - + - unary minus and plus
  • : - sequence operator
  • %any% - special operators (including %% and %/%)
  • * / - multiply, divide
  • + - - (binary) add, subtract
  • < > <= >= == != - ordering and comparison
  • ! - negation
  • & && - and
  • | || - or
  • ~ - as in formulae
  • -> ->> - rightwards assignment
  • <- <<- - assignment (right to left)
  • = - assignment (right to left)
  • ? - help (unary and binary)

6.5 Joining data with dplyr

While merge works just fine , joining data frames with dplyr might offer some additional conveniences:

  • rows are kept in existing order
  • much faster
  • tells you what keys you’re merging by (if you don’t supply)
  • also work with database tables.

dplyr implements the four most useful joins from SQL:

  • inner_join - similar to merge(…, all.x=F, all.y=F)
  • ileft_join - similar to merge(…, all.x=T, all.y=F)
  • isemi_join - no equivalent in merge() unless y includes only join fields
  • ianti_join - no equivalent in merge(), this is all x without a match in y

7 lmer syntax and S4 architecture

Locating model estimation results in S4 objects produced by of lmer() calls of lme4 package. back to Appendix

7.1 Basics

Prepare data for modeling. Only the first 200 respondents will be selected to keep illustration light.

dsL<-readRDS("./Data/Derived/dsL.rds")
ds<- dsL %>% dplyr::filter(id %in% c(1:200),year %in% c(2000:2011)) %>% 
  dplyr::mutate(timec=year-2000, timec2= timec^2, timec3= timec^3, 
                agec= round( (agemon/12),0)-16) %>% 
  dplyr::select(id,year,attend, timec,timec2, timec3, agec)
head(ds, 20)
   id year attend timec timec2 timec3 agec
1   1 2000      1     0      0      0    3
2   1 2001      6     1      1      1    4
3   1 2002      2     2      4      8    5
4   1 2003      1     3      9     27    6
5   1 2004      1     4     16     64    7
6   1 2005      1     5     25    125    8
7   1 2006      1     6     36    216    9
8   1 2007      1     7     49    343   10
9   1 2008      1     8     64    512   11
10  1 2009      1     9     81    729   12
11  1 2010      1    10    100   1000   13
12  1 2011      1    11    121   1331   14
13  2 2000      2     0      0      0    2
14  2 2001      2     1      1      1    3
15  2 2002      1     2      4      8    4
16  2 2003      1     3      9     27    5
17  2 2004      2     4     16     64    6
18  2 2005      2     5     25    125    8
19  2 2006     NA     6     36    216   NA
20  2 2007     NA     7     49    343   NA

Fit the model with lmer

require(lme4)
Loading required package: lme4
Loading required package: Matrix
Loading required package: Rcpp
m10 <-lmer (attend ~ 
               1  + agec + timec + timec2 + timec3
             + agec:timec +agec:timec2 + agec:timec3
             + (1 + timec + timec2 + timec3 | id),
             data = ds, REML=0)
Warning: convergence code 1 from bobyqa: bobyqa -- maximum number of function evaluations exceeded
Warning: Model failed to converge with max|grad| = 680.408 (tol = 0.002)
Warning: the condition has length > 1 and only the first element will be used
Warning: Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
 - Rescale variables?
model<- m10

Print the basic results of the fitted model

print(model) 
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: attend ~ 1 + agec + timec + timec2 + timec3 + agec:timec + agec:timec2 +  
    agec:timec3 + (1 + timec + timec2 + timec3 | id)
   Data: ds
     AIC      BIC   logLik deviance df.resid 
    6490     6595    -3226     6452     1854 
Random effects:
 Groups   Name        Std.Dev. Corr             
 id       (Intercept) 1.23574                   
          timec       0.82117  -0.12            
          timec2      0.16635   0.00 -0.95      
          timec3      0.00906   0.03  0.90 -0.98
 Residual             1.06840                   
Number of obs: 1873, groups: id, 191
Fixed Effects:
(Intercept)         agec        timec       timec2       timec3   agec:timec  agec:timec2  agec:timec3  
   3.37e+00    -1.33e-01    -1.32e-01    -1.95e-02     7.12e-04     3.83e-02    -1.41e-03     2.02e-05  

Or get a more detailed summary

summary(model)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: attend ~ 1 + agec + timec + timec2 + timec3 + agec:timec + agec:timec2 +  
    agec:timec3 + (1 + timec + timec2 + timec3 | id)
   Data: ds

     AIC      BIC   logLik deviance df.resid 
    6490     6595    -3226     6452     1854 

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-3.559 -0.442 -0.097  0.375  4.992 

Random effects:
 Groups   Name        Variance Std.Dev. Corr             
 id       (Intercept) 1.53e+00 1.23574                   
          timec       6.74e-01 0.82117  -0.12            
          timec2      2.77e-02 0.16635   0.00 -0.95      
          timec3      8.21e-05 0.00906   0.03  0.90 -0.98
 Residual             1.14e+00 1.06840                   
Number of obs: 1873, groups: id, 191

Fixed effects:
             Estimate Std. Error t value
(Intercept)  3.37e+00   2.22e-01   15.16
agec        -1.33e-01   7.49e-02   -1.78
timec       -1.32e-01   1.64e-01   -0.80
timec2      -1.95e-02   4.20e-02   -0.46
timec3       7.12e-04   3.21e-03    0.22
agec:timec   3.83e-02   3.26e-02    1.17
agec:timec2 -1.41e-03   5.00e-03   -0.28
agec:timec3  2.02e-05   2.64e-04    0.08

Correlation of Fixed Effects:
            (Intr) agec   timec  timec2 timec3 agc:tm agc:t2
agec        -0.843                                          
timec       -0.085 -0.147                                   
timec2      -0.008  0.198 -0.698                            
timec3       0.038 -0.186  0.455 -0.927                     
agec:timec   0.490 -0.504 -0.558  0.015  0.199              
agec:timec2 -0.338  0.275  0.756 -0.449  0.215 -0.865       
agec:timec3  0.223 -0.121 -0.739  0.745 -0.617  0.573 -0.886

For a list of objects that can be extracted from the model

str(summary(model))
List of 16
 $ methTitle   : chr "Linear mixed model fit by maximum likelihood "
 $ objClass    : atomic [1:1] lmerMod
  ..- attr(*, "package")= chr "lme4"
 $ devcomp     :List of 2
  ..$ cmp : Named num [1:10] 888.9 83.4 1631.4 506.6 2138 ...
  .. ..- attr(*, "names")= chr [1:10] "ldL2" "ldRX2" "wrss" "ussq" ...
  ..$ dims: Named int [1:12] 1873 1873 8 1865 10 764 1 1 0 0 ...
  .. ..- attr(*, "names")= chr [1:12] "N" "n" "p" "nmp" ...
 $ isLmer      : logi TRUE
 $ useScale    : logi TRUE
 $ logLik      :Class 'logLik' : -3226 (df=19)
 $ family      : NULL
 $ link        : NULL
 $ ngrps       : Named num 191
  ..- attr(*, "names")= chr "id"
 $ coefficients: num [1:8, 1:3] 3.365265 -0.133152 -0.13205 -0.019508 0.000712 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:8] "(Intercept)" "agec" "timec" "timec2" ...
  .. ..$ : chr [1:3] "Estimate" "Std. Error" "t value"
 $ sigma       : num 1.07
 $ vcov        :Formal class 'dpoMatrix' [package "Matrix"] with 5 slots
  .. ..@ x       : num [1:64] 4.93e-02 -1.40e-02 -3.09e-03 -7.35e-05 2.72e-05 ...
  .. ..@ Dim     : int [1:2] 8 8
  .. ..@ Dimnames:List of 2
  .. .. ..$ : chr [1:8] "(Intercept)" "agec" "timec" "timec2" ...
  .. .. ..$ : chr [1:8] "(Intercept)" "agec" "timec" "timec2" ...
  .. ..@ uplo    : chr "U"
  .. ..@ factors :List of 2
  .. .. ..$ Cholesky   :Formal class 'Cholesky' [package "Matrix"] with 5 slots
  .. .. .. .. ..@ x       : num [1:64] 0.222 0 0 0 0 ...
  .. .. .. .. ..@ Dim     : int [1:2] 8 8
  .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. ..$ : NULL
  .. .. .. .. .. ..$ : NULL
  .. .. .. .. ..@ uplo    : chr "U"
  .. .. .. .. ..@ diag    : chr "N"
  .. .. ..$ correlation:Formal class 'corMatrix' [package "Matrix"] with 6 slots
  .. .. .. .. ..@ sd      : num [1:8] 0.22206 0.07489 0.16415 0.04198 0.00321 ...
  .. .. .. .. ..@ x       : num [1:64] 1 -0.84315 -0.0847 -0.00788 0.03823 ...
  .. .. .. .. ..@ Dim     : int [1:2] 8 8
  .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. ..$ : chr [1:8] "(Intercept)" "agec" "timec" "timec2" ...
  .. .. .. .. .. ..$ : chr [1:8] "(Intercept)" "agec" "timec" "timec2" ...
  .. .. .. .. ..@ uplo    : chr "U"
  .. .. .. .. ..@ factors :List of 1
  .. .. .. .. .. ..$ Cholesky:Formal class 'Cholesky' [package "Matrix"] with 5 slots
  .. .. .. .. .. .. .. ..@ x       : num [1:64] 1 0 0 0 0 ...
  .. .. .. .. .. .. .. ..@ Dim     : int [1:2] 8 8
  .. .. .. .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. .. .. .. ..$ : NULL
  .. .. .. .. .. .. .. .. ..$ : NULL
  .. .. .. .. .. .. .. ..@ uplo    : chr "U"
  .. .. .. .. .. .. .. ..@ diag    : chr "N"
 $ varcor      :List of 1
  ..$ id: num [1:4, 1:4] 1.527062 -0.119554 0.00028 0.000291 -0.119554 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:4] "(Intercept)" "timec" "timec2" "timec3"
  .. .. ..$ : chr [1:4] "(Intercept)" "timec" "timec2" "timec3"
  .. ..- attr(*, "stddev")= Named num [1:4] 1.23574 0.82117 0.16635 0.00906
  .. .. ..- attr(*, "names")= chr [1:4] "(Intercept)" "timec" "timec2" "timec3"
  .. ..- attr(*, "correlation")= num [1:4, 1:4] 1 -0.11782 0.00136 0.02598 -0.11782 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:4] "(Intercept)" "timec" "timec2" "timec3"
  .. .. .. ..$ : chr [1:4] "(Intercept)" "timec" "timec2" "timec3"
  ..- attr(*, "sc")= num 1.07
  ..- attr(*, "useSc")= logi TRUE
  ..- attr(*, "class")= chr "VarCorr.merMod"
 $ AICtab      : Named num [1:5] 6490 6595 -3226 6452 1854
  ..- attr(*, "names")= chr [1:5] "AIC" "BIC" "logLik" "deviance" ...
 $ call        : language lmer(formula = attend ~ 1 + agec + timec + timec2 + timec3 + agec:timec + agec:timec2 + agec:timec3 + (1 + timec +      timec2 + timec3 | id), data = ds, REML = 0)
 $ residuals   : Named num [1:1873] -1.653 3.34 -0.111 -0.781 -0.545 ...
  ..- attr(*, "names")= chr [1:1873] "1" "2" "3" "4" ...
 - attr(*, "class")= chr "summary.merMod"

There a number of ways one could extract the needed element from the S4 object. In addition, some of the elements might be stored in several locations.

7.2 Model formula

model@call # 1
lmer(formula = attend ~ 1 + agec + timec + timec2 + timec3 + 
    agec:timec + agec:timec2 + agec:timec3 + (1 + timec + timec2 + 
    timec3 | id), data = ds, REML = 0)
(summary(model))$call #2
lmer(formula = attend ~ 1 + agec + timec + timec2 + timec3 + 
    agec:timec + agec:timec2 + agec:timec3 + (1 + timec + timec2 + 
    timec3 | id), data = ds, REML = 0)

7.3 Fit and Information indices

# get indicies
mInfo<-summary(model)$AICtab
mInfo
     AIC      BIC   logLik deviance df.resid 
    6490     6595    -3226     6452     1854 
mInfo['logLik']
logLik 
 -3226 

Aleternatively,

logLik<-logLik(model)
dev<-deviance(model)
AIC <- AIC(model) 
BIC <- BIC(model)
N<- model@devcomp$dims["N"] # Looks like the total number of data points ( individual-by-time)
p<- model@devcomp$dims["p"] # Looks like the number of estimated parameters, verify
ids<- (summary(model))$ngrps # number of units on level-2, here: individuals
mInfo<- c("logLik"=logLik,"dev"=dev,"AIC"=AIC,"BIC"=BIC,"N"=N, "p"=p, "ids"=ids)
mInfo
logLik    dev    AIC    BIC    N.N    p.p ids.id 
 -3226   6452   6490   6595   1873      8    191 

7.4 Random Effects (RE)

7.4.1 Matrix of RE

str(summary(model)$varcor)
List of 1
 $ id: num [1:4, 1:4] 1.527062 -0.119554 0.00028 0.000291 -0.119554 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "(Intercept)" "timec" "timec2" "timec3"
  .. ..$ : chr [1:4] "(Intercept)" "timec" "timec2" "timec3"
  ..- attr(*, "stddev")= Named num [1:4] 1.23574 0.82117 0.16635 0.00906
  .. ..- attr(*, "names")= chr [1:4] "(Intercept)" "timec" "timec2" "timec3"
  ..- attr(*, "correlation")= num [1:4, 1:4] 1 -0.11782 0.00136 0.02598 -0.11782 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:4] "(Intercept)" "timec" "timec2" "timec3"
  .. .. ..$ : chr [1:4] "(Intercept)" "timec" "timec2" "timec3"
 - attr(*, "sc")= num 1.07
 - attr(*, "useSc")= logi TRUE
 - attr(*, "class")= chr "VarCorr.merMod"
# extract RE covariance matrix
mRE<- data.frame(summary(model)$varcor$id) 
mRE
            X.Intercept.     timec     timec2     timec3
(Intercept)    1.5270624 -0.119554  0.0002804  2.908e-04
timec         -0.1195542  0.674320 -0.1302896  6.658e-03
timec2         0.0002804 -0.130290  0.0276723 -1.483e-03
timec3         0.0002908  0.006658 -0.0014833  8.205e-05
mRE<- data.frame(VarCorr(model)$id)
mRE
            X.Intercept.     timec     timec2     timec3
(Intercept)    1.5270624 -0.119554  0.0002804  2.908e-04
timec         -0.1195542  0.674320 -0.1302896  6.658e-03
timec2         0.0002804 -0.130290  0.0276723 -1.483e-03
timec3         0.0002908  0.006658 -0.0014833  8.205e-05
mREcor<-  data.frame(attr(summary(model)$varcor$id,"correlation")) # corrleation 
mREcor
            X.Intercept.   timec    timec2   timec3
(Intercept)     1.000000 -0.1178  0.001364  0.02598
timec          -0.117816  1.0000 -0.953793  0.89509
timec2          0.001364 -0.9538  1.000000 -0.98438
timec3          0.025983  0.8951 -0.984383  1.00000
mREcor<- data.frame( attr(VarCorr(model)$id,"correlation"))
mREcor
            X.Intercept.   timec    timec2   timec3
(Intercept)     1.000000 -0.1178  0.001364  0.02598
timec          -0.117816  1.0000 -0.953793  0.89509
timec2          0.001364 -0.9538  1.000000 -0.98438
timec3          0.025983  0.8951 -0.984383  1.00000
mREsd<-   data.frame(STD=(attr(summary(model)$varcor$id,"stddev"))) 
mREsd
                 STD
(Intercept) 1.235744
timec       0.821170
timec2      0.166350
timec3      0.009058

7.4.2 extracting RE for each individual

RE<- lme4:::ranef.merMod(model)$id 
head(RE,5)
  (Intercept)   timec    timec2    timec3
1     -0.1998 -0.1948 -0.005072  0.001247
2     -0.8681 -0.2208  0.064432 -0.003808
3     -0.1842 -0.6816  0.197827 -0.011676
4     -0.7941  0.1096 -0.031264  0.002802
5      0.1088 -0.4967  0.096427 -0.004834
tail(RE,5)
    (Intercept)   timec    timec2     timec3
196    -1.54368 -0.1687  0.054824 -0.0030731
197    -0.09957 -0.0621  0.009258 -0.0005129
198    -0.97355  0.2022 -0.046348  0.0027714
199    -1.40829 -0.1739  0.040914 -0.0022034
200     2.42073  0.5940 -0.104779  0.0053983
mRE
            X.Intercept.     timec     timec2     timec3
(Intercept)    1.5270624 -0.119554  0.0002804  2.908e-04
timec         -0.1195542  0.674320 -0.1302896  6.658e-03
timec2         0.0002804 -0.130290  0.0276723 -1.483e-03
timec3         0.0002908  0.006658 -0.0014833  8.205e-05
cor(RE)  # not the same as mRE, find out why
            (Intercept)   timec  timec2  timec3
(Intercept)      1.0000  0.0319 -0.1996  0.2342
timec            0.0319  1.0000 -0.9257  0.8486
timec2          -0.1996 -0.9257  1.0000 -0.9817
timec3           0.2342  0.8486 -0.9817  1.0000
var(RE)  # not the same as mRE, find out why
            (Intercept)     timec     timec2     timec3
(Intercept)    1.487037  0.019434 -0.0233827  0.0014309
timec          0.019434  0.249575 -0.0444308  0.0021240
timec2        -0.023383 -0.044431  0.0092298 -0.0004726
timec3         0.001431  0.002124 -0.0004726  0.0000251

7.5 Fixed Effects (RE)

7.5.1 Matrix of FE

str(summary(model)$vcov)
Formal class 'dpoMatrix' [package "Matrix"] with 5 slots
  ..@ x       : num [1:64] 4.93e-02 -1.40e-02 -3.09e-03 -7.35e-05 2.72e-05 ...
  ..@ Dim     : int [1:2] 8 8
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:8] "(Intercept)" "agec" "timec" "timec2" ...
  .. ..$ : chr [1:8] "(Intercept)" "agec" "timec" "timec2" ...
  ..@ uplo    : chr "U"
  ..@ factors :List of 2
  .. ..$ Cholesky   :Formal class 'Cholesky' [package "Matrix"] with 5 slots
  .. .. .. ..@ x       : num [1:64] 0.222 0 0 0 0 ...
  .. .. .. ..@ Dim     : int [1:2] 8 8
  .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. ..$ : NULL
  .. .. .. .. ..$ : NULL
  .. .. .. ..@ uplo    : chr "U"
  .. .. .. ..@ diag    : chr "N"
  .. ..$ correlation:Formal class 'corMatrix' [package "Matrix"] with 6 slots
  .. .. .. ..@ sd      : num [1:8] 0.22206 0.07489 0.16415 0.04198 0.00321 ...
  .. .. .. ..@ x       : num [1:64] 1 -0.84315 -0.0847 -0.00788 0.03823 ...
  .. .. .. ..@ Dim     : int [1:2] 8 8
  .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. ..$ : chr [1:8] "(Intercept)" "agec" "timec" "timec2" ...
  .. .. .. .. ..$ : chr [1:8] "(Intercept)" "agec" "timec" "timec2" ...
  .. .. .. ..@ uplo    : chr "U"
  .. .. .. ..@ factors :List of 1
  .. .. .. .. ..$ Cholesky:Formal class 'Cholesky' [package "Matrix"] with 5 slots
  .. .. .. .. .. .. ..@ x       : num [1:64] 1 0 0 0 0 ...
  .. .. .. .. .. .. ..@ Dim     : int [1:2] 8 8
  .. .. .. .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. .. .. .. ..$ : NULL
  .. .. .. .. .. .. .. ..$ : NULL
  .. .. .. .. .. .. ..@ uplo    : chr "U"
  .. .. .. .. .. .. ..@ diag    : chr "N"
mFE<- (summary(model)$vcov@factors$correlation) # notice that this is object of class corMatrix
str(mFE)
Formal class 'corMatrix' [package "Matrix"] with 6 slots
  ..@ sd      : num [1:8] 0.22206 0.07489 0.16415 0.04198 0.00321 ...
  ..@ x       : num [1:64] 1 -0.84315 -0.0847 -0.00788 0.03823 ...
  ..@ Dim     : int [1:2] 8 8
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:8] "(Intercept)" "agec" "timec" "timec2" ...
  .. ..$ : chr [1:8] "(Intercept)" "agec" "timec" "timec2" ...
  ..@ uplo    : chr "U"
  ..@ factors :List of 1
  .. ..$ Cholesky:Formal class 'Cholesky' [package "Matrix"] with 5 slots
  .. .. .. ..@ x       : num [1:64] 1 0 0 0 0 ...
  .. .. .. ..@ Dim     : int [1:2] 8 8
  .. .. .. ..@ Dimnames:List of 2
  .. .. .. .. ..$ : NULL
  .. .. .. .. ..$ : NULL
  .. .. .. ..@ uplo    : chr "U"
  .. .. .. ..@ diag    : chr "N"

7.5.2 estimate of the FE

# similar ways to extract FE estimates, #3 is the fullest
FE1<- fixef(model) #1 
FE2<- getME(model, "beta") # 2
FE3<- summary(model)$coefficients # 3
coefs<- FE1

7.6 Prediction and Residuals

7.6.1 Restatement of model input

First, recover information that went into the model

# dsP - P for predicted
dsp<-data.frame(model@frame) # original vars used by the model (no interaction terms)
head(dsp,13)
   attend agec timec timec2 timec3 id
1       1    3     0      0      0  1
2       6    4     1      1      1  1
3       2    5     2      4      8  1
4       1    6     3      9     27  1
5       1    7     4     16     64  1
6       1    8     5     25    125  1
7       1    9     6     36    216  1
8       1   10     7     49    343  1
9       1   11     8     64    512  1
10      1   12     9     81    729  1
11      1   13    10    100   1000  1
12      1   14    11    121   1331  1
13      2    2     0      0      0  2

Another way, which includes interaction terms, but no outcome

dsp<- data.frame(getME(model,"X")) # no Y, only predictors (with interaction terms)
head(dsp,13)
   X.Intercept. agec timec timec2 timec3 agec.timec agec.timec2 agec.timec3
1             1    3     0      0      0          0           0           0
2             1    4     1      1      1          4           4           4
3             1    5     2      4      8         10          20          40
4             1    6     3      9     27         18          54         162
5             1    7     4     16     64         28         112         448
6             1    8     5     25    125         40         200        1000
7             1    9     6     36    216         54         324        1944
8             1   10     7     49    343         70         490        3430
9             1   11     8     64    512         88         704        5632
10            1   12     9     81    729        108         972        8748
11            1   13    10    100   1000        130        1300       13000
12            1   14    11    121   1331        154        1694       18634
13            1    2     0      0      0          0           0           0

We can add response vector and the grouping factor for the second level (individual)

dsp$id<-getME(model,"flist")$id # first level grouping factor, individual
dsp$y<-getME(model,"y") # response vector
head(dsp,13)
   X.Intercept. agec timec timec2 timec3 agec.timec agec.timec2 agec.timec3 id y
1             1    3     0      0      0          0           0           0  1 1
2             1    4     1      1      1          4           4           4  1 6
3             1    5     2      4      8         10          20          40  1 2
4             1    6     3      9     27         18          54         162  1 1
5             1    7     4     16     64         28         112         448  1 1
6             1    8     5     25    125         40         200        1000  1 1
7             1    9     6     36    216         54         324        1944  1 1
8             1   10     7     49    343         70         490        3430  1 1
9             1   11     8     64    512         88         704        5632  1 1
10            1   12     9     81    729        108         972        8748  1 1
11            1   13    10    100   1000        130        1300       13000  1 1
12            1   14    11    121   1331        154        1694       18634  1 1
13            1    2     0      0      0          0           0           0  2 2

7.6.2 Adding model output

There are several way to extract the predictions made by the model

# model outcome, predicted values
yHat1<- fitted(model) # 1
yHat2<- predict(model) # 2
yHat3<- getME(model,"mu") # 3

identical(yHat1,yHat2)
[1] TRUE
identical(as.numeric(yHat2),yHat3)
[1] TRUE
dsp$yHat<- predict(model)
head(dsp,13)
   X.Intercept. agec timec timec2 timec3 agec.timec agec.timec2 agec.timec3 id y  yHat
1             1    3     0      0      0          0           0           0  1 1 2.766
2             1    4     1      1      1          4           4           4  1 6 2.431
3             1    5     2      4      8         10          20          40  1 2 2.119
4             1    6     3      9     27         18          54         162  1 1 1.834
5             1    7     4     16     64         28         112         448  1 1 1.582
6             1    8     5     25    125         40         200        1000  1 1 1.367
7             1    9     6     36    216         54         324        1944  1 1 1.196
8             1   10     7     49    343         70         490        3430  1 1 1.074
9             1   11     8     64    512         88         704        5632  1 1 1.009
10            1   12     9     81    729        108         972        8748  1 1 1.007
11            1   13    10    100   1000        130        1300       13000  1 1 1.078
12            1   14    11    121   1331        154        1694       18634  1 1 1.228
13            1    2     0      0      0          0           0           0  2 2 2.231

7.6.3 Adding residual

dsp$resid<- lme4:::residuals.merMod(model) # individual residual (id and time)
head(dsp,13)
   X.Intercept. agec timec timec2 timec3 agec.timec agec.timec2 agec.timec3 id y  yHat     resid
1             1    3     0      0      0          0           0           0  1 1 2.766 -1.766010
2             1    4     1      1      1          4           4           4  1 6 2.431  3.568956
3             1    5     2      4      8         10          20          40  1 2 2.119 -0.119025
4             1    6     3      9     27         18          54         162  1 1 1.834 -0.834349
5             1    7     4     16     64         28         112         448  1 1 1.582 -0.581892
6             1    8     5     25    125         40         200        1000  1 1 1.367 -0.367016
7             1    9     6     36    216         54         324        1944  1 1 1.196 -0.195566
8             1   10     7     49    343         70         490        3430  1 1 1.074 -0.073873
9             1   11     8     64    512         88         704        5632  1 1 1.009 -0.008749
10            1   12     9     81    729        108         972        8748  1 1 1.007 -0.007492
11            1   13    10    100   1000        130        1300       13000  1 1 1.078 -0.077881
12            1   14    11    121   1331        154        1694       18634  1 1 1.228 -0.228184
13            1    2     0      0      0          0           0           0  2 2 2.231 -0.230870
identical (  dsp$y-dsp$yHat, dsp$resid)
[1] TRUE

Getting the standard error of residuals

getME(model,"sigma") # standard error of residuals, same sigma(model)
[1] 1.068
sigma(model) # std.error of residuals <- this methods is preferred
[1] 1.068
# however
sd(dsp$resid) # not the same as sigma(model) = find out why
[1] 0.9335
identical (sigma(model),sd(dsp$resid)) # WHY?
[1] FALSE

7.7 Conditional values

Predictions form fixed effects only, no individual variability is used in calculation

# create object "coefs" for easy reference
coefs <- fixef(model)
# use fixed effects estimates to find conditional predictions??
dsp$yPar<-(
  (coefs["(Intercept)"])         +(coefs["agec"]*dsp$agec)
  +(coefs["timec"]*dsp$timec)    +(coefs["agec:timec"]*dsp$agec*dsp$timec)
  +(coefs["timec2"]*dsp$timec2)  +(coefs["agec:timec2"]*dsp$agec*dsp$timec2)
  +(coefs["timec3"]*dsp$timec3)  +(coefs["agec:timec3"]*dsp$agec*dsp$timec3)
)
head(dsp,13)
   X.Intercept. agec timec timec2 timec3 agec.timec agec.timec2 agec.timec3 id y  yHat     resid  yPar
1             1    3     0      0      0          0           0           0  1 1 2.766 -1.766010 2.966
2             1    4     1      1      1          4           4           4  1 6 2.431  3.568956 2.829
3             1    5     2      4      8         10          20          40  1 2 2.119 -0.119025 2.719
4             1    6     3      9     27         18          54         162  1 1 1.834 -0.834349 2.631
5             1    7     4     16     64         28         112         448  1 1 1.582 -0.581892 2.562
6             1    8     5     25    125         40         200        1000  1 1 1.367 -0.367016 2.512
7             1    9     6     36    216         54         324        1944  1 1 1.196 -0.195566 2.477
8             1   10     7     49    343         70         490        3430  1 1 1.074 -0.073873 2.458
9             1   11     8     64    512         88         704        5632  1 1 1.009 -0.008749 2.453
10            1   12     9     81    729        108         972        8748  1 1 1.007 -0.007492 2.462
11            1   13    10    100   1000        130        1300       13000  1 1 1.078 -0.077881 2.486
12            1   14    11    121   1331        154        1694       18634  1 1 1.228 -0.228184 2.525
13            1    2     0      0      0          0           0           0  2 2 2.231 -0.230870 3.099

[model_specification]